Introduction


The Bechdel test aims at measuring the representation of women in movies.
When it was created, the rule was simple: a movie would pass the test if it had at least two named female characters. But recently, fivethirtyeight.com imagined new rules (e.g. “every department has two or more women”).

It is my turn now to imagine a rule assessing the representation of women in a specific industry. I’ve thus decided to work on political journalism in France, using the following rule: “At least 50% of the articles on politics were written by a woman”.

In this analysis, three French newspapers will be scrutinized under this rule: Le Monde, L’Humanité, and L’Opinion. Who wins? Who loses? Let’s find this out!



Data preparation


Methodology


How did I collect the names of the political journalists?
I scrapped (i.e. extracted data from webpages) the websites of the newspapers, using the library rvest.

Why did I pick Le Monde, L’Humanité and L’Opinion?
My criteria were the following:

  • The newspaper is French
  • The newspaper has a Political section
  • The newspaper displays the names of the journalists in the feed

I thus considered analyzing the following newspapers, but didn’t pick them for multiple reasons:

  • Les Echos: rarely displays the names of the journalists
  • Le Parisien: impossible to scrap (error 404)
  • Le Figaro: does not display the names of the journalists in the feed
  • Libération: no clear political section
  • Slate.fr: too complicated to scrap (requires to click a button “More articles” at the end of the page to load more articles)

How did I determine the gender of the journalists?
I used the dataset “Gender by Name” from data.world (credits to Derek Howard) that gives the gender of every name.

How many articles did I analyze?
I analyzed all the articles that were published between September 2017 and October 2018 in the Political sections of the websites. This led me to the following numbers of articles:

  • Le Monde: 4,116 articles were scrapped, 4,116 were used in the analysis
  • L’Humanité: 1,608 articles were scrapped, 1,576 were used in the analysis
  • L’Opinion: 2,689 articles were scrapped, 2,049 were used in the analysis

Note that the names of the journalists were not always clear (for instance, some articles were signed “L’Opinion Vidéo”). These articles were thus removed from the analysis, hence the difference between the number of articles scrapped and analyzed.

Preparing scrapping

library(rvest)
library(magrittr)
library(dplyr)
library(tidyr)
library(gsubfn)
library(knitr)
library(kableExtra)
library(plotly)
Sys.setlocale(locale="en_us.UTF-8")
## [1] "en_us.UTF-8/en_us.UTF-8/en_us.UTF-8/C/en_us.UTF-8/C"
# Downloading a dataset giving, for each name, the gender
gender_by_name <- read.csv("https://query.data.world/s/2laxunthmvqedzzkelp4wdgx5xulum", 
                           header = TRUE,
                           stringsAsFactors = FALSE)
# Source: https://data.world/howarder/gender-by-name

# Adding some missing names and genders
name_to_add <- data.frame(rbind(c("Aurelien", "M", 1), 
                                c("Francette", "F", 1),
                                c("Nehla", "F", 1),
                                c("Pierric", "M", 1),
                                c("Violaine", "F", 1),
                                c("Aurane", "F", 1),
                                c("Brune", "F", 1),
                                c("Geoffroy", "M", 1),
                                c("Ivanne", "F", 1),
                                c("Mahasti", "F", 1),
                                c("Pieyre", "M", 1),
                                c("Flavie", "F", 1),
                                c("Gaidz", "M", 1),
                                c("Rafaele", "F", 1),
                                c("Soazig", "F", 1)
                                ))

colnames(name_to_add) <- c("name", "gender", "probability")

gender_by_name <- rbind(gender_by_name, name_to_add)
# Creating a function that extracts the journalists' names from a specific webpage
scrap_names <- function(url, xpath) {
  html <- read_html(url)
  Sys.sleep(sample(10, 1) * 0.1) # Let's try to look more human: we set some random waiting time between each url
  return(html %>% 
           html_nodes(xpath = xpath) %>% 
           html_text())
}

# Creating a function that removes the accents
unwanted_array <- list('Š'='S', 'š'='s', 'Ž'='Z', 'ž'='z', 'À'='A', 'Á'='A', 'Â'='A', 'Ã'='A', 'Ä'='A','Å'='A', 'Æ'='A', 'Ç'='C', 'È'='E', 'É'='E','Ê'='E', 'Ë'='E', 'Ì'='I', 'Í'='I', 'Î'='I', 'Ï'='I', 'Ñ'='N', 'Ò'='O', 'Ó'='O', 'Ô'='O', 'Õ'='O', 'Ö'='O', 'Ø'='O', 'Ù'='U', 'Ú'='U', 'Û'='U', 'Ü'='U', 'Ý'='Y', 'Þ'='B', 'ß'='Ss', 'à'='a', 'á'='a', 'â'='a', 'ã'='a', 'ä'='a', 'å'='a', 'æ'='a', 'ç'='c', 'è'='e', 'é'='e', 'ê'='e', 'ë'='e', 'ì'='i', 'í'='i', 'î'='i', 'ï'='i', 'ð'='o', 'ñ'='n', 'ò'='o', 'ó'='o', 'ô'='o', 'õ'='o', 'ö'='o', 'ø'='o', 'ù'='u', 'ú'='u', 'û'='u', 'ý'='y', 'ý'='y', 'þ'='b', 'ÿ'='y' )

remove_accents <- function(name) {
  gsubfn(paste(names(unwanted_array),collapse='|'), unwanted_array, name)
}

# Creating a function that extracts the first name (or the first part of the first name if is a composed name)
extract_first_name <- function(name) {
  first_name = remove_accents(substr(name, 1, regexpr(" ", name)-1))
  first_name = gsub(" ", "", case_when(regexpr("-", first_name) != -1 ~ substr(first_name, 1, regexpr("-", first_name)-1),
                                       TRUE ~ first_name)) # if it is a composed name, let's just keep the first part
  return(first_name)
}
# Defining URLs to scrap and specifying where to find the names of the journalists in these pages
base_url_lemonde <- 'https://www.lemonde.fr/politique/'
## I need to break down Le Monde's URLs into three vectors, because Le Monde seems to limit scrappings -> if too many pages, it generates an error
list_url_lemonde_1 <- paste0(base_url_lemonde, c(1:100), ".html")
list_url_lemonde_2 <- paste0(base_url_lemonde, c(101:200), ".html")
list_url_lemonde_3 <- paste0(base_url_lemonde, c(201:300), ".html") 
xpath_lemonde <- './/*[@class="auteur"]'

base_url_humanite <- 'https://www.humanite.fr/politique?page='
list_url_humanite <- paste0(base_url_humanite, c(1:200))
xpath_humanite <- './/*[@class="field field-name-field-news-auteur field-type-node-reference field-label-hidden"]'

base_urL_lopinion <- 'https://www.lopinion.fr/edition/politique/index/page/'
list_url_lopinion <- paste0(base_urL_lopinion, c(1:90), "/0")
xpath_lopinion <- './/*[@class="article-snippet_author"]'

Scrapping Le Monde

# Extracting the names from all the pages
names_lemonde_1 <- list_url_lemonde_1 %>% 
  mapply(scrap_names, 
         ., 
         MoreArgs = list(xpath_lemonde),
         SIMPLIFY = "array") %>% 
  unlist(use.names = FALSE)

names_lemonde_1 %>% write.csv('names_lemonde_1.csv')


names_lemonde_2 <- list_url_lemonde_2 %>% 
  mapply(scrap_names, 
         ., 
         MoreArgs = list(xpath_lemonde),
         SIMPLIFY = "array") %>% 
  unlist(use.names = FALSE)

names_lemonde_2 %>% write.csv('names_lemonde_2.csv')


names_lemonde_3 <- list_url_lemonde_3 %>% 
  mapply(scrap_names, 
         ., 
         MoreArgs = list(xpath_lemonde),
         SIMPLIFY = "array") %>% 
  unlist(use.names = FALSE)

names_lemonde_3 %>% write.csv('names_lemonde_3.csv')
# Loading the list of names
names_lemonde <- read.csv('names_lemonde_1.csv', 
                          col.names = c("x", "raw_names"), 
                          colClasses = c("character")) %>% 
  rbind(read.csv('names_lemonde_2.csv', 
                          col.names = c("x", "raw_names"), 
                          colClasses = c("character"))) %>% 
  rbind(read.csv('names_lemonde_3.csv', 
                          col.names = c("x", "raw_names"), 
                          colClasses = c("character"))) %>% 
  select(-x)

# Computing the number of articles scrapped for the "Methodology section"
methodology_articles <- data.frame(newspaper = "Le Monde",
                                   n_scrapped_articles = as.integer(count(names_lemonde)),
                                   stringsAsFactors=FALSE)

# Computing the number of articles per journalist, and isolating their first names
names_lemonde <- names_lemonde %>%
  group_by(raw_names) %>% 
  summarize(n_articles = n()) %>%
  ungroup() %>% 
  mutate(first_names = extract_first_name(raw_names),
         id = row_number())

# Storing the results in a data frame
df_lemonde <- names_lemonde %>% 
  merge(gender_by_name, 
        by.x = "first_names", 
        by.y = "name", 
        all.x = TRUE)

df_lemonde <- df_lemonde[c(4,2,1,5,6,3)]

df_lemonde %>% filter(is.na(gender))
## [1] id          raw_names   first_names gender      probability n_articles 
## <0 rows> (or 0-length row.names)
# Adding the number of articles analyzed to the methodology_articles df
methodology_articles <- methodology_articles %>% 
  mutate(n_analyzed_articles = df_lemonde$n_articles %>% sum())

Scrapping L’humanité

# Extracting the names from all the pages
names_humanite <- list_url_humanite %>%
  mapply(scrap_names, 
         ., 
         MoreArgs = list(xpath_humanite),
         SIMPLIFY = "array") %>% 
  unlist(use.names = FALSE)

names_humanite %>% write.csv('names_humanite.csv')
names_humanite <- read.csv('names_humanite.csv', col.names = c("x", "raw_names")) %>% 
  select(-x)

# Filling methodology_articles df for L'Humanité
methodology_articles <- methodology_articles %>% 
  rbind(c("L'Humanité", nrow(names_humanite),0))

# Data Cleaning
##  1. Some rows contain several names (separated by a comma, "et", or "avec") -> Let's isolate them and extract all the names
names_humanite <- names_humanite %>%
  filter(!(rownames(.) %in% grep(",", raw_names))) %>%  # Rows with a "," represent a small nb of names (4 rows), but too much work -> let's remove them
  mutate(name_1 = case_when(rownames(.) %in% grep(" et ", raw_names) ~ substr(raw_names, 1, regexpr(" et ", raw_names)-1),
                            rownames(.) %in% grep(" avec ", raw_names) ~ substr(raw_names, 1, regexpr(" avec ", raw_names)-1),
                            TRUE ~ as.character(raw_names)),
         name_2 = case_when(rownames(.) %in% grep(" et ", raw_names) ~ substr(raw_names, regexpr(" et ", raw_names)+4, 100),
                            rownames(.) %in% grep(" avec ", raw_names) ~ substr(raw_names, regexpr(" avec ", raw_names)+6, 100)))

### We still have some rows with multiple names. Let's quickly fix it:
names_humanite <- names_humanite %>% 
  mutate(
    name_3 = case_when(
      rownames(.) %in% grep("avec", name_1) ~ case_when(
        rownames(.) %in% grep("avec ", name_1) ~ substr(name_1, regexpr("avec ", name_1)+5, 100))),
    
    name_1 = case_when(
      rownames(.) %in% grep("avec", name_1) ~ case_when(
        rownames(.) %in% grep("avec ", name_1) ~ substr(name_1, 1, regexpr("avec ", name_1)-1)),
      TRUE ~ name_1))


##  2. Now, let's create a new dataframe with the list of names from the 3 columns.
list_names_1 <- names_humanite %>% 
  filter(regexpr("\\.", name_1) == -1) %>% # Removing abbreviations (e.g. "G.M.")
  select(name_1) %>% 
  as.list()

list_names_2 <- names_humanite %>% 
  filter(regexpr("\\.", name_2) == -1) %>%# Removing abbreviations (e.g. "G.M.")
  select(name_2) %>%
  filter(is.na(name_2) == FALSE)

list_names_3 <- names_humanite %>% 
  filter(regexpr("\\.", name_3) == -1) %>% # Removing abbreviations (e.g. "G.M.")
  select(name_3) %>% 
  filter(is.na(name_3) == FALSE)

list_names <- list_names_1 %>%
  append(list_names_2) %>% 
  append(list_names_3) %>% 
  unlist()

df_humanite <- data.frame(raw_names = list_names, row.names = c()) # Tadam!


# Data preparation
## 1. Computing the number of articles per journalist, and isolating their first names
df_humanite <- df_humanite %>%
  group_by(raw_names) %>% 
  summarize(n_articles = n()) %>%
  ungroup() %>% 
  mutate(first_names = extract_first_name(raw_names),
         id = row_number())

## 2. Storing the results in a dataframe
df_humanite <- df_humanite %>% 
  merge(gender_by_name, 
        by.x = "first_names", 
        by.y = "name", 
        all.x = TRUE) %>% 
  filter(!(is.na(gender)))

df_humanite <- df_humanite[c(4,2,1,5,6,3)]

# Note that there are some mispellings: Laurène Bureau became Lauren bureau, Ludovic Finez became ludo finez, etc.

# Adding the number of articles to the methodology_articles df
methodology_articles <- methodology_articles %>% 
  mutate(n_analyzed_articles = case_when(newspaper == "L'Humanité" ~ sum(df_humanite$n_articles),
                                         TRUE ~ as.integer(n_analyzed_articles)))

Scrapping L’Opinion

# Extracting the names from all the pages
names_lopinion <- list_url_lopinion %>%
  mapply(scrap_names, 
         ., 
         MoreArgs = list(xpath_lopinion),
         SIMPLIFY = "array") %>% 
  unlist(use.names = FALSE)

names_lopinion %>% write.csv('names_lopinion.csv')
names_lopinion <- read.csv('names_lopinion.csv', col.names = c("x", "raw_names")) %>% 
  select(-x)

# Filling methodology_articles df for L'Humanité
methodology_articles <- methodology_articles %>% 
  rbind(c("L'Opinion", nrow(names_lopinion),0))

# Data Cleaning
##  1. Some rows contain several names (separated by a comma, "et", or "avec") -> Let's isolate them and extract all the names
names_lopinion <- names_lopinion %>%
  filter(!(rownames(.) %in% grep(",", raw_names))) %>%  # Rows with a "," represent a small nb of names (8 rows), but too much work -> let's remove them
  mutate(name_1 = case_when(rownames(.) %in% grep(" et ", raw_names) ~ substr(raw_names, 1, regexpr(" et ", raw_names)-1),
                            TRUE ~ as.character(raw_names)),
         name_2 = case_when(rownames(.) %in% grep(" et ", raw_names) ~ substr(raw_names, regexpr(" et ", raw_names)+4, 100)))


##  2. Now, let's create a new dataframe with the list of names from the 3 columns.
list_names_1 <- names_lopinion %>% 
  filter(regexpr("\\.", name_1) == -1) %>% # Removing abbreviations (e.g. "G.M.")
  select(name_1) %>% 
  as.list()

list_names_2 <- names_lopinion %>% 
  filter(regexpr("\\.", name_2) == -1) %>% # Removing abbreviations (e.g. "G.M.")
  select(name_2) %>%
  filter(is.na(name_2) == FALSE)

list_names <- list_names_1 %>%
  append(list_names_2) %>% 
  unlist()

df_lopinion <- data.frame(raw_names = list_names, row.names = c()) # Tadam!

# Data prepration
## 1. Computing the number of articles per journalist, and isolating their first names
df_lopinion <- df_lopinion %>%
  group_by(raw_names) %>% 
  summarize(n_articles = n()) %>%
  ungroup() %>% 
  mutate(first_names = extract_first_name(raw_names),
         id = row_number())

## 2. Let's now add the gender
df_lopinion <- df_lopinion %>% 
  merge(gender_by_name, 
        by.x = "first_names", 
        by.y = "name", 
        all.x = TRUE)

df_lopinion <- df_lopinion %>% 
  filter(!(is.na(gender)))

df_lopinion <- df_lopinion[c(4,2,1,5,6,3)]

# Adding the number of articles to the methodology_articles df
methodology_articles <- methodology_articles %>% 
  mutate(n_analyzed_articles = case_when(newspaper == "L'Opinion" ~ sum(df_lopinion$n_articles),
                                         TRUE ~ as.integer(n_analyzed_articles)))

Analyzing data

# 1. Computing the share of articles written by males vs females
s_articles_by_gender <- df_lopinion %>% 
  group_by(gender) %>% 
  summarize(n_articles_lopinion = sum(n_articles)) %>% 
  merge(df_humanite %>% 
          group_by(gender) %>% 
          summarize(n_articles_lhumanite = sum(n_articles)),
        by = "gender") %>% 
  merge(df_lemonde %>% 
          group_by(gender) %>% 
          summarize(n_articles_lemonde = sum(n_articles)),
        by = "gender")

s_articles_by_gender <- s_articles_by_gender %>% 
  mutate("# articles (L'Opinion)" = n_articles_lopinion,
         "% articles (L'Opinion)" = round(n_articles_lopinion / sum(n_articles_lopinion), 2),
         "# articles (L'Humanité)" = n_articles_lhumanite,
         "% articles (L'Humanité)" = round(n_articles_lhumanite / sum(n_articles_lhumanite),2),
         "# articles (Le Monde)" = n_articles_lemonde,
         "% articles (Le Monde)" = round(n_articles_lemonde / sum(n_articles_lemonde), 2)
         ) %>% 
  select(-n_articles_lopinion, -n_articles_lhumanite, -n_articles_lemonde)

# 2. Computing the share of female vs male journalists
s_journalists_by_gender <- df_lopinion %>% 
  group_by(gender) %>% 
  summarize(n_journalists_lopinion = n_distinct(id)) %>% 
  merge(df_humanite %>% 
          group_by(gender) %>% 
          summarize(n_journalists_lhumanite = n_distinct(id)),
        by = "gender") %>% 
  merge(df_lemonde %>% 
          group_by(gender) %>% 
          summarize(n_journalists_lemonde = n_distinct(id)))

s_journalists_by_gender <- s_journalists_by_gender %>% 
  mutate("# journalists (L'Opinion)" = n_journalists_lopinion,
         "% journalists (L'Opinion)" = round(n_journalists_lopinion / sum(n_journalists_lopinion), 2),
         "# journalists (L'Humanité)" = n_journalists_lhumanite,
         "% journalists (L'Humanité)" =  round(n_journalists_lhumanite / sum(n_journalists_lhumanite), 2),
         "# journalists (Le Monde)" = n_journalists_lemonde,
         "% journalists (Le Monde)" = round(n_journalists_lemonde / sum(n_journalists_lemonde), 2)) %>% 
  select(-n_journalists_lopinion, -n_journalists_lhumanite, -n_journalists_lemonde)
# VERY SMALL NUMBERS -> ARE THEY CORRECT?

# 3. Merging these two results
genders_in_journalism <- merge(s_journalists_by_gender, s_articles_by_gender, by = "gender")

#-> How can you explain that 33% of the journalists are women, and 56% of the article were written by females? Is it due to a few female journalists, or a real trend?






Findings


Rule #1: “50% of the political articles were written by women”

Only one out of the three newspapers - L’Opinion - passes the test: 56% of their political articles were written by women (that is 1,139 articles out of 2,049). As for the two other newspapers, Le Monde almost passes the test (48% of articles written by women), but L’Humanité is further from parity, with not even a third of the articles written by women (31%).



gender # articles (L’Opinion) % articles (L’Opinion) # articles (L’Humanité) % articles (L’Humanité) # articles (Le Monde) % articles (Le Monde)
F 1139 56% 485 31% 1987 48%
M 910 44% 1091 69% 2129 52%



If these first results give us an idea of the share of articles written by female journalists, it doesn’t tell us if women represent 50% of the journalists. Let’s use another rule to assess this.






Rule #2: “50% of the political journalists are women”

gender # journalists (L’Opinion) % journalists (L’Opinion) # journalists (L’Humanité) % journalists (L’Humanité) # journalists (Le Monde) % journalists (Le Monde)
F 53 33% 35 38% 115 51%
M 106 67% 56 62% 112 49%

That’s a surprise: L’Opinion passed the first test, but failed the second one, when Le Monde failed the first one, and succeeded in the second one.

Average number of articles per journalist
gender L’Opinion L’Humanité Le Monde
F 21 14 17
M 9 19 19


The reason behind this is a difference in the average number of articles per journalist. At L’Opinion for instance, women write on average twice as many articles as their male counterparts (21 articles for a woman vs 9 for a man). Conversely, women at Le Monde’s write slighly fewer articles than men (17 for females vs 19 for males).

<
span style=“color:#355c5e;”>Though, can we conclude that women tend to write more articles than men at L’Opinion?




To answer this question, let’s take a more detailed look at the number of articles written by L’Opinion’s journalists.
As you can see in the box plots below, most females and males wrote more or less the same number of articles (~ 1 to 3 articles). The difference in the average thus comes from a small number of female journalists who wrote many more articles than the rest of the authors.

Indeed, the quantiles are pretty much the same for males and females, whereas females have more “high” outliers (for instance, 6 females (out of 53) wrote more than 100 articles, when only 3 males (out of 106) did). The average number of articles per female journalist is thus boosted by these few individuals.



One can thus hardly conclude that, at L’Opinion, women tend to write more articles than men.


Conclusions


What have we learnt?

Source: Observation des métiers de la presse

How could this analysis be improved?